Why AI excels at coding but still fails at simple everyday questions

Category: Analysis

Podcast Producer

10 April 2026

Listen On

AI models can now solve complex programming tasks in hours while still failing at simple everyday questions. According to Andrej Karpathy, that is not a contradiction but a reflection of how progress in AI is uneven across domains.

Karpathy says there are currently two very different perspectives on AI progress. One group has tried the free version of ChatGPT or the voice mode and formed an opinion based on obvious mistakes, weak reasoning, and hallucinations. In his view, however, those older or less capable models no longer reflect the current frontier.

The second group uses the latest professional-grade systems such as OpenAI Codex or Claude Code in technical domains like programming, mathematics, and research. There, Karpathy argues, progress this year has been dramatic: these models can independently refactor entire codebases or identify security vulnerabilities. As a result, the two groups are often talking past each other.

“It is simultaneously true that OpenAI’s free and, in my opinion, somewhat neglected Advanced Voice Mode fails at the dumbest questions in Instagram Reels, while at the same time OpenAI’s most expensive paid Codex model can spend an hour coherently restructuring an entire codebase or finding and exploiting vulnerabilities in computer systems.” - Х

Behind that observation is a deeper point: fields such as coding and mathematics, where outcomes can be clearly verified and reinforced through feedback, are currently benefiting far more from AI progress than areas without clean evaluation metrics, such as writing, consulting, or open-ended advice.

Verifiability as the key to progress

Karpathy’s argument touches on one of the central questions in AI research today: can language models develop into a more general intelligence, or can they only be optimized to perform efficiently in specific domains with well-defined feedback loops?

He addressed this structural issue in an earlier essay on what he called the “Software 2.0” paradigm. In that framework, the critical factor is not whether a task can be precisely specified, but whether it can be verified. Only when a system can receive automated feedback, such as right-or-wrong judgments or clear reward signals, can it be effectively improved through reinforcement learning. As Karpathy put it, “the more verifiable a task is, the better it can be automated in this new programming paradigm.”

Last summer, rumors circulated about a possible “Universal Verifier” at OpenAI that could extend reinforcement learning across all areas of knowledge. So far, however, nothing concrete has emerged. Meanwhile, Jerry Tworek, one of the leading figures behind OpenAI’s reinforcement learning strategy, has left the company and recently said that “deep learning research is essentially complete” - Х

Admin

Podcast Producer

Wannabe alcohol nerd. Analyst. Web practitioner. Devoted travel trailblazer. Professional reader. Gamer.

Podcast by Admin

Recent Podcasts

Adobe Reinvents Document Work with Acrobat Studio and AI

Guides

AI News

Accenture Tracks AI Tool Usage and Ties Adoption to Promotions

Adobe Firefly Introduces Unlimited AI Image and Video Generation for Subscribers

AGI May Arrive by 2026–2027, Warns Anthropic CEO Dario Amodei

AI Agent Beats 804 Human Programmers in Major Coding Tournament

AI & Society

AI Agents Create a Lobster Religion on Moltbook

AI Could Trigger a Major U.S. Economic Crisis by 2028, Citrini Research Warns

AI Is Increasing Workload Instead of Reducing It, ActivTrak Study Finds

Amazon Launches Health AI Assistant in One Medical App

AI Insights

Adobe Reinvents Document Work with Acrobat Studio and AI

AI agents could disrupt ads and reshape internet commerce

AI as a Role Model for Generation Alpha: Promise, Risks, and the Future of Childhood

AI as a Toy: Why Humanity Always Misuses New Technology First